NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Illusion of State in State-Space Models

Merrill, William; Petty, Jackson; Sabharwal, Ashish (July 2024, International Conference on Machine Learning (ICML) 2024)

State-space models (SSMs) have emerged as a potential alternative architecture for building large language models (LLMs) compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class 𝖳𝖢0. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems.
more » « less
Full Text Available
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases.

Hu, Michael Y; Petty, Jackson; Shi, Chuan; Merrill, William; Linzen, Tal (July 2024, The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025))

Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations.
more » « less
Full Text Available
How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure

https://doi.org/10.1162/tacl_a_00608

Wilson, Michael; Petty, Jackson; Frank, Robert (November 2023, Transactions of the Association for Computational Linguistics)

Abstract Language models are typically evaluated on their success at predicting the distribution of specific words in specific contexts. Yet linguistic knowledge also encodes relationships between contexts, allowing inferences between word distributions. We investigate the degree to which pre-trained transformer-based large language models (LLMs) represent such relationships, focusing on the domain of argument structure. We find that LLMs perform well in generalizing the distribution of a novel noun argument between related contexts that were seen during pre-training (e.g., the active object and passive subject of the verb spray), succeeding by making use of the semantically organized structure of the embedding space for word embeddings. However, LLMs fail at generalizations between related contexts that have not been observed during pre-training, but which instantiate more abstract, but well-attested structural generalizations (e.g., between the active object and passive subject of an arbitrary verb). Instead, in this case, LLMs show a bias to generalize based on linear order. This finding points to a limitation with current models and points to a reason for which their training is data-intensive.1
more » « less
Full Text Available
(QA)^2: Question Answering with Questionable Assumptions

Kim, Najoung; Htut, Phu Mon; Bowman, Samuel R.; Petty, Jackson (July 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)

Naturally-occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers to information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical when question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)2 (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally-occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)2, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. We find that current models do struggle with handling questionable assumptions -- the best performing model achieves 59% human rater acceptability on abstractive QA with (QA)2 questions, leaving substantial headroom for progress.
more » « less
Full Text Available
Do Language Models Learn Position-Role Mappings?

Petty, Jackson; Wilson, Michael; Frank, Robert (January 2022, Proceedings of the 46th annual Boston University Conference on Language Development)
Gong, Ying; Kpogo, Felix (Ed.)
Full Text Available
Sequence-to-Sequence Networks Learn the Meaning of Reflexive Anaphora

Frank, Robert; Petty, Jackson (December 2020, Proceedings of the 3rd Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2020))
null (Ed.)
Full Text Available

Search for: All records